Introduction

New England groundfish have been harvested commercially for more than 400 years. However, in the last thirty years, stocks of several important species have declined severely with a few at historically low levels. The Northeast Fisheries Science Center (NEFSC) bottom trawl survey is an essential component for stock assessment and fishery management in New England. However, many stock assessments rely on uncertain estimates of catchability. Commercial fishers’ on the water observations as well as recent studies suggest that the NEFSC bottom trawl survey is not 100% efficient for most of the species it surveys. Differences in catch efficiency and inaccurate estimates of survey catchability can result in added uncertainty for stock assessments, impacting the management process, catch limits, and the economic viability of commercial fisheries. Following other experiments in the region, this study aims to provide estimates of survey efficiency for several species in the multi-species groundfish complex. In the absence of an experimental paired-gear study, an opportunistic approach was used to compare the relative efficiency of survey gears to commercial gears. High resolution fisheries dependent data from the NEFSC’s Study Fleet and Observer programs were leveraged to identify tows that occurred in the relative vicinity and timeframe of NEFSC survey tows. Catch ratios between commercial trawl gears and the NEFSC bottom trawl survey were compared and the relative efficiency for the NEFSC survey was estimated. Relative efficiency estimates can be used to help improve estimates of catchability in the stock assessment process.

A first-cut analysis will use paired t-tests or Wiloxcon sign-rank test to assess differences in mean catch between commercial and survey trawl nets. More advanced analysis will be conducted with generalized liner models (GLM) with a binomial distribution to model and compare catch rates. The response variable will be the survey catch divided by the total catch from the commercial trawl and survey trawl for each paired observation for each species of interest. \[ \frac{C_s}{C_s + C_c} \] Explanatory variables will include trawl net characteristics, environmental conditions and seasonal information. The two commercial datasets were investigated separately because they have different information, for example the observer dataset has more information on the type of trawl gear used and detailed gear characteristics.



Results

Study Fleet and NEFSC Pairs

5m and 72hr Paired Tows

Cod

Exploratory Visualization

#Looking at the cpua for the paired tows for each species 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 73 & SPECIES_ITIS == 164712) %>% 
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  filter(sf_species_cpua < 10000) %>%
  filter(nefsc_species_cpua < 10000) %>%
  ggplot() + 
  geom_bar(aes(x = EFFORT_ID, y = sf_species_cpua, fill = "Study Fleet"), stat = "identity") + 
  geom_bar(aes(x = EFFORT_ID, y = -nefsc_species_cpua, fill = "NEFSC"), stat = "identity") + 
  labs(x = "Tow Number", y = CPUA_label, 
       title = "Comparison of CPUA for Atlantic Cod", 
       subtitle = "Pairs within 5 mi and 72 hrs") + 
  scale_fill_manual(name = "Dataset", 
                    values = c("Study Fleet" = "red", "NEFSC" = "blue"), 
                    labels = c("Study Fleet", "NEFSC")) + 
  theme_classic() + 
  theme(axis.text.x=element_blank())

#Looking at the relative efficiency of the study fleet tows compared to the nefsc survey 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 73 & SPECIES_ITIS == 164712) %>%
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  filter(sf_species_cpua < 10000) %>%
  filter(nefsc_species_cpua < 10000) %>%
  ggplot() +
  geom_histogram(aes(x = relative_efficiency), bins = 50, color = "black", fill = "white") + 
  geom_vline(aes(xintercept = 0.5),
            color = "black", linetype = "dashed", size = 1) + 
  geom_vline(aes(xintercept = mean(relative_efficiency)),
            color = "red", linetype = "dashed", size = 1) + 
  labs(x = "Relative Efficiency", y = "Count", 
       title = "Histogram of Relative Efficiency for Atlantic Cod", 
       subtitle = "Pairs within 5 mi and 72 hrs")

#Map of the matched tows 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 73 & SPECIES_ITIS == 164712) %>%
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  filter(sf_species_cpua < 10000) %>%
  filter(nefsc_species_cpua < 10000) %>%
  ggplot() + 
  geom_sf(data = nefsc_strata, fill = "grey") +
  geom_polygon(data = NEUS, aes(x=long, y = lat, group = group), fill = "darkgreen",  
               color="black",inherit.aes = FALSE)  +
  geom_point(data = cod_paired_tows_5mi_72hr_re, aes(x = END_TOW_LON.x,y = END_TOW_LAT.x, color = "NEFSC")) +
  geom_point(data = cod_paired_tows_5mi_72hr_re, aes(x = END_TOW_LON.y,y = END_TOW_LAT.y, color = "Study Fleet")) +
  
  # geom_segment(data = paired_tows_1mi_24hr, aes(x = START_TOW_LON.x, xend = END_TOW_LON.x,
  #                                               y = START_TOW_LAT.x, yend = END_TOW_LAT.x)) +
  coord_sf(ylim = c(39, 44),  xlim = c(-74, -66)) + 
  scale_colour_manual("", breaks = c("Study Fleet", "NEFSC"), values = c("red", "blue")) +
  labs(title = "Map of Matched Study Fleet and NEFSC Tows for Cod", subtitle = "within 5 mi and 72 hrs")


A comparison of the catch-per-unit area illustrates that the commercial trawls from the study fleet seem to be catching more Cod than the survey trawl for the matched tows. The histogram of the relative efficiency shows that the average relative efficiency is about 0.37, if the survey and commerical trawls were equally efficient it would be 0.5. A map of the matched tows shows that there are three distinct areas where the matched tows occurred: southern New England, Georges Bank, and the Gulf of Maine.


T-Test

#looking at the boxplot and histograms of the catch per unit area 
box.plot(cod_paired_tows_5mi_72hr_re)

hist_kh(cod_paired_tows_5mi_72hr_re)

#log-transforming the data
cod_paired_tows_5mi_72hr_re <- cod_paired_tows_5mi_72hr_re %>%
  mutate(diff = sf_species_cpua - nefsc_species_cpua,
         log_sf_species_cpua = log(sf_species_cpua), 
         log_nefsc_species_cpua = log(nefsc_species_cpua), 
         diff_of_logs = log_sf_species_cpua - log_nefsc_species_cpua)

#looking at the boxplot and histograms of the difference between CPUA
box.plot_diff(cod_paired_tows_5mi_72hr_re)

hist_kh_diff(cod_paired_tows_5mi_72hr_re)

#looking at the boxplot and histograms of the diff_og_logs
box.plot_diff_log(cod_paired_tows_5mi_72hr_re)

hist_kh_diff_log(cod_paired_tows_5mi_72hr_re)

#assumptions of normality and equal variance are met with the log transformed data, 
#so the t-test was conducted with the log-transformed data


#running t-test
cod_ttest <- t.test(cod_paired_tows_5mi_72hr_re$log_sf_species_cpua, cod_paired_tows_5mi_72hr_re$log_nefsc_species_cpua, paired = TRUE)
cod_ttest
## 
##  Paired t-test
## 
## data:  cod_paired_tows_5mi_72hr_re$log_sf_species_cpua and cod_paired_tows_5mi_72hr_re$log_nefsc_species_cpua
## t = 4.5522, df = 96, p-value = 0.00001556
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4845498 1.2338666
## sample estimates:
## mean of the differences 
##               0.8592082
#back transform the estimate
exp(cod_ttest$estimate)
## mean of the differences 
##                 2.36129
exp(cod_ttest$conf.int)
## [1] 1.623444 3.434484
## attr(,"conf.level")
## [1] 0.95


The boxplot for the data shows that the variance is not equal and the histogram shows that the catch rates for the commercial and survey nets are not normally distributed. However, because the data are paired, two observations (commercial/survey CPUA) from one study unit (‘paired tow’) the difference in catch-per-unit area needs to be investigated. The difference in catch-per-unit area is not normally distributed and is not symmetric, so a log-transformation was performed. The histograms of the log-transformed data show that they are relatively normally distributed and the boxplot shows relative symmetry, so the analysis was conducted with the log-transformed data.

These data provide substantial evidence that the mean difference between commercial and survey catch-per-unit area is non-zero (two-sided p-value < 0.0001 from a paired t-test). The mean for catch rate for Cod for commercial trawls from the study fleet was estimated to be 2.36 times the mean catch rate for the NEFSC survey (95% confidence interval: 1.62 to 3.43 times).


Model Runs

#scatterplot matrix of all the variables of interest 
ggpairs(cod_paired_tows_5mi_72hr_re, columns = c("VESSEL_NAME", "season", "mgmt_area", "AVGDEPTH", "BOTTEMP"), 
        cardinality_threshold = 17)

#running all the models w every combination of variables and comparing them based on AIC and deviance explained 
codm <- fit.models(cod_paired_tows_5mi_72hr_re)
compare.models(codm)
Model AIC deltaAIC Deviance Explained
relative_efficiency ~ VESSEL_NAME +season + AVGDEPTH 144 23 0.24
relative_efficiency ~ VESSEL_NAME +AVGDEPTH 144 23 0.25
relative_efficiency ~ VESSEL_NAME +season 142 22 0.23
relative_efficiency ~ VESSEL_NAME +season + mgmt_area 142 22 0.29
relative_efficiency ~ VESSEL_NAME +mgmt_area 142 21 0.27
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + AVGDEPTH 142 21 0.25
relative_efficiency ~ VESSEL_NAME +mgmt_area + AVGDEPTH 141 20 0.42
relative_efficiency ~ VESSEL_NAME +AVGDEPTH + BOTTEMP 129 9 0.30
relative_efficiency ~ VESSEL_NAME +BOTTEMP 128 7 0.33
relative_efficiency ~ VESSEL_NAME +mgmt_area + BOTTEMP 127 7 0.31
relative_efficiency ~ VESSEL_NAME +mgmt_area + AVGDEPTH + BOTTEMP 125 5 0.32
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + AVGDEPTH + BOTTEMP 123 3 0.42
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + BOTTEMP 122 2 0.43
relative_efficiency ~ VESSEL_NAME +season + AVGDEPTH + BOTTEMP 122 2 0.39
relative_efficiency ~ VESSEL_NAME +season + BOTTEMP 121 0 0.44
#dropping the NAs in the bottom-temp column 
cod_paired_tows_5mi_72hr_re1 <- cod_paired_tows_5mi_72hr_re %>%
  drop_na(BOTTEMP)


#storing the optimal model and plotting the diagnostics
codm_optimal <- glm(relative_efficiency ~ VESSEL_NAME + season + BOTTEMP, family = binomial, data = cod_paired_tows_5mi_72hr_re1)
summary(codm_optimal)
## 
## Call:
## glm(formula = relative_efficiency ~ VESSEL_NAME + season + BOTTEMP, 
##     family = binomial, data = cod_paired_tows_5mi_72hr_re1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.99051  -0.38422  -0.07438   0.28222   1.59493  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                       6.38231    4.19793   1.520   0.1284  
## VESSEL_NAMEANGELA AND ROSE       -1.84122    2.58070  -0.713   0.4756  
## VESSEL_NAMEBANTRY BAY            -0.57134    1.98281  -0.288   0.7732  
## VESSEL_NAMEDEBBIE SUE             2.59255    2.48559   1.043   0.2969  
## VESSEL_NAMEELIZABETH & KATHERINE -1.85155    2.96040  -0.625   0.5317  
## VESSEL_NAMEELLEN DIANE           -1.92680    3.02801  -0.636   0.5246  
## VESSEL_NAMEILLUSION               2.94936    2.01506   1.464   0.1433  
## VESSEL_NAMEJULIE ANN II           0.48688    2.93234   0.166   0.8681  
## VESSEL_NAMELADY JANE             -0.01325    2.20066  -0.006   0.9952  
## VESSEL_NAMELIGHTNING BAY          1.03748    2.28454   0.454   0.6497  
## VESSEL_NAMELISA ANN II           -0.48460    2.12686  -0.228   0.8198  
## VESSEL_NAMELISA ANN III          -1.13499    2.05536  -0.552   0.5808  
## VESSEL_NAMEMARY K                 0.31465    2.00326   0.157   0.8752  
## VESSEL_NAMEMYSTIC                -0.99968    2.09555  -0.477   0.6333  
## VESSEL_NAMEPROUD MARY            -0.25380    1.84218  -0.138   0.8904  
## VESSEL_NAMESAO PAULO              2.56159    2.31609   1.106   0.2687  
## VESSEL_NAMEVIRGINIA MARISE        1.73632    2.25721   0.769   0.4418  
## seasonSpring                     -4.76016    2.39518  -1.987   0.0469 *
## BOTTEMP                          -0.56756    0.34343  -1.653   0.0984 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.846  on 87  degrees of freedom
## Residual deviance: 20.237  on 69  degrees of freedom
## AIC: 120.62
## 
## Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(codm_optimal)

#back transform the estimates and CIs 
exp(cbind(OR = coef(codm_optimal), confint(codm_optimal)))
##                                             OR          2.5 %          97.5 %
## (Intercept)                      591.292459414 0.162293638272 3538678.9619207
## VESSEL_NAMEANGELA AND ROSE         0.158623367 0.000687165966      54.0605767
## VESSEL_NAMEBANTRY BAY              0.564769973 0.011729491708     100.1315050
## VESSEL_NAMEDEBBIE SUE             13.363852717 0.146915769377    5164.4709266
## VESSEL_NAMEELIZABETH & KATHERINE   0.156994226 0.000001201245      55.5718625
## VESSEL_NAMEELLEN DIANE             0.145613112 0.000244491967     129.7574586
## VESSEL_NAMEILLUSION               19.093682355 0.472936473989    3648.7819642
## VESSEL_NAMEJULIE ANN II            1.627229777 0.005327979453    3419.3671994
## VESSEL_NAMELADY JANE               0.986836257 0.014833851599     225.2333466
## VESSEL_NAMELIGHTNING BAY           2.822093906 0.019280503169     671.7331853
## VESSEL_NAMELISA ANN II             0.615941092 0.006935902171     122.8853938
## VESSEL_NAMELISA ANN III            0.321425362 0.003856911038      59.5855919
## VESSEL_NAMEMARY K                  1.369783894 0.027815718338     249.8621894
## VESSEL_NAMEMYSTIC                  0.367996732 0.006650077343      74.3270091
## VESSEL_NAMEPROUD MARY              0.775843183 0.025269066707     122.6586934
## VESSEL_NAMESAO PAULO              12.956372963 0.172241576370    3764.4715404
## VESSEL_NAMEVIRGINIA MARISE         5.676413856 0.092525554042    1727.4990242
## seasonSpring                       0.008564267 0.000052531766       0.7387508
## BOTTEMP                            0.566907277 0.273567409672       1.0761265
#look at the component residual plots 
crPlots(codm_optimal)

# looking at the effect_plot - these dont work with binomial regression 
# effect_plot(codm_optimal, pred = VESSEL_NAME) + 
#   labs(x = "Vessel Name", y = "Relative Efficiency") + 
#   coord_flip()

# Extract the fitted and predicted values 
cod_paired_tows_5mi_72hr_re1$predict <- predict(codm_optimal)
cod_paired_tows_5mi_72hr_re1$fitted <- fitted(codm_optimal)

#predicted vs observed values 
cod_paired_tows_5mi_72hr_re1 %>%
  ggplot(aes(x = fitted, 
             y = relative_efficiency)) + 
  geom_point() + 
  geom_abline(intercept=0, slope=1) +
  xlim(0,1) + 
  ylim(0,1) + 
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')

# #predicted vs trawl type w point shape as explanatory variables - not important for this project but saving for my thesis work 
# #mgmt_area
# cod_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME, 
#              y = fitted, color = mgmt_area)) + 
#   geom_jitter() + 
#   labs(x = "Vessel Name", y = "Fitted Values") + 
#   coord_flip()
# 
# #season
# cod_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME, 
#              y = fitted, color = season)) + 
#   geom_jitter() + 
#   coord_flip()
# 
# #depth 
# cod_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME, 
#              y = fitted, color = BOTTEMP)) + 
#   geom_jitter() + 
#   coord_flip()



#getting final RE estimate 
cod_paired_tows_5mi_72hr_re1 %>%
  group_by(VESSEL_NAME) %>%
  summarise(mean(fitted), n = n())
## # A tibble: 17 × 3
##    VESSEL_NAME           `mean(fitted)`     n
##    <chr>                          <dbl> <int>
##  1 ANGELA & ROSE                 0.258      2
##  2 ANGELA AND ROSE               0.322      2
##  3 BANTRY BAY                    0.369     10
##  4 DEBBIE SUE                    0.586      2
##  5 ELIZABETH & KATHERINE         0.0915     2
##  6 ELLEN DIANE                   0.514      1
##  7 ILLUSION                      0.578      8
##  8 JULIE ANN II                  0.658      1
##  9 LADY JANE                     0.505      6
## 10 LIGHTNING BAY                 0.213      3
## 11 LISA ANN II                   0.204      4
## 12 LISA ANN III                  0.126      6
## 13 MARY K                        0.135      9
## 14 MYSTIC                        0.395     21
## 15 PROUD MARY                    0.332      7
## 16 SAO PAULO                     0.596      2
## 17 VIRGINIA MARISE               0.703      2
cod_paired_tows_5mi_72hr_re1 %>%
  summarise(mean(fitted), n = n())
##   mean(fitted)  n
## 1    0.3600107 88
cod_paired_tows_5mi_72hr_re1 %>%
  ggplot(aes(x = VESSEL_NAME, 
             y = fitted)) + 
  geom_jitter() + 
  stat_summary(
    geom = "point",
    fun = "mean",
    col = "black",
    size = 3,
    shape = 24,
    fill = "red"
  ) + 
  labs(x = "Vessel Name", y = "Fitted Values of Relative Efficiency") + 
  geom_hline(yintercept = 0.3600107, linetype = "dashed", color = "red") + 
  coord_flip()


A scatterplot matrix of all the variables of interest showed no correlation among explanatory variables, but this is difficult to assess given that most of the explanatory variables are categorical.

A backward regression analysis was conducted, however no variables were found to be significant so model selection was based on AIC score and the deviance explained.

The optimal model for relative efficiency for Cod from the Study Fleet dataset included vessel, season, and bottom temperature as explanatory variables. The diagnostic plots show no concerning trends, so the interpretation of results proceeded with this optimal model.

The component residual show a strong negative relationship with bottom temperature and higher relative efficiency in the fall.

The most efficient vessels as compared to the reference vessel, the Angela & Rose, after accounting for season and bottom temperature were the Debbie Sue which had a relative efficiency for Cod 13.36 times higher (95% CI: 0.14 to 5164 times), the Illusion which had a relative efficiency 19.10 times higher (95% CI: 0.47 to 3648 times), and the Sao Paulo which had a relative efficiency 12.96 times higher (95% CI: 0.17 to 3764 times).

The modeled values of relative efficiency show that the most efficient vessels compared to the survey trawl were the Mary K, Lisa Ann III, Lisa Ann II, and the Elizabeth & Katherine. The mean of the relative efficiency of the modeled values was 0.36, which is substantially less than if the commerical and survey trawls had equal efficiency



YellowTail Flounder

Exploratory Visualization

#Looking at the cpua for the paired tows for each species 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 105 & SPECIES_ITIS == 172909) %>% 
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_bar(aes(x = EFFORT_ID, y = sf_species_cpua, fill = "Study Fleet"), stat = "identity") + 
  geom_bar(aes(x = EFFORT_ID, y = -nefsc_species_cpua, fill = "NEFSC"), stat = "identity") + 
  labs(x = "Tow Number", y = CPUA_label, 
       title = "Comparison of CPUA for Yellowtail Flounder", 
       subtitle = "Pairs within 5 mi and 72 hrs") + 
  scale_fill_manual(name = "Dataset", 
                    values = c("Study Fleet" = "red", "NEFSC" = "blue"), 
                    labels = c("Study Fleet", "NEFSC")) + 
  theme_classic() + 
  theme(axis.text.x=element_blank())

#filter out the outlier tow and retain only tows with less than 10,000 lbs/km2
paired_tows_5mi_72hr_re <- paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 105 & SPECIES_ITIS == 172909) %>% 
  filter(sf_species_cpua < 10000)

#re run that plot without the outlier 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 105 & SPECIES_ITIS == 172909) %>% 
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_bar(aes(x = EFFORT_ID, y = sf_species_cpua, fill = "Study Fleet"), stat = "identity") + 
  geom_bar(aes(x = EFFORT_ID, y = -nefsc_species_cpua, fill = "NEFSC"), stat = "identity") + 
  labs(x = "Tow Number", y = CPUA_label, 
       title = "Comparison of CPUA for Yellowtail Flounder", 
       subtitle = "Pairs within 5 mi and 72 hrs") + 
  scale_fill_manual(name = "Dataset", 
                    values = c("Study Fleet" = "red", "NEFSC" = "blue"), 
                    labels = c("Study Fleet", "NEFSC")) + 
  theme_classic() + 
  theme(axis.text.x=element_blank())

#filtering it out in the yellowtail species dataset 
yellowtailfl_paired_tows_5mi_72hr_re <- yellowtailfl_paired_tows_5mi_72hr_re %>% 
  filter(sf_species_cpua < 10000, 
         nefsc_species_cpua < 10000)



#Looking at the relative efficiency of the study fleet tows compared to the nefsc survey 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 105 & SPECIES_ITIS == 172909) %>%
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  ggplot() +
  geom_histogram(aes(x = relative_efficiency), bins = 50, color = "black", fill = "white") + 
  geom_vline(aes(xintercept = 0.5),
            color = "black", linetype = "dashed", size = 1) + 
  geom_vline(aes(xintercept = mean(relative_efficiency)),
            color = "red", linetype = "dashed", size = 1) + 
  labs(x = "Relative Efficiency", y = "Count", 
       title = "Histogram of Relative Efficiency for Yellowtail Flounder", 
       subtitle = "Pairs within 5 mi and 72 hrs")

#Map of the matched tows 
paired_tows_5mi_72hr_re %>%
  filter(SVSPP == 105 & SPECIES_ITIS == 172909) %>%
  distinct(EFFORT_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_sf(data = nefsc_strata, fill = "grey") +
  geom_polygon(data = NEUS, aes(x=long, y = lat, group = group), fill = "darkgreen",  
               color="black",inherit.aes = FALSE)  +
  geom_point(data = yellowtailfl_paired_tows_5mi_72hr_re, aes(x = END_TOW_LON.x,y = END_TOW_LAT.x, color = "NEFSC")) +
  geom_point(data = yellowtailfl_paired_tows_5mi_72hr_re, aes(x = END_TOW_LON.y,y = END_TOW_LAT.y, color = "Study Fleet")) +
  
  # geom_segment(data = paired_tows_1mi_24hr, aes(x = START_TOW_LON.x, xend = END_TOW_LON.x,
  #                                               y = START_TOW_LAT.x, yend = END_TOW_LAT.x)) +
  coord_sf(ylim = c(39, 44),  xlim = c(-74, -66)) + 
  scale_colour_manual("", breaks = c("Study Fleet", "NEFSC"), values = c("red", "blue")) +
  labs(title = "Map of Matched Study Fleet and NEFSC Tows for Yellowtail Flounder", subtitle = "within 5 mi and 72 hrs")


A comparison of the catch-per-unit area illustrates one large outlier in the Study Fleet dataset. The CPUA was limited to less than 10,000 lbs/km^2. The subsequent plot shows that the commercial trawls from the study fleet seem to catch more yellowtail flounder than the survey trawl in most instances. However, there are some matched tows where the survey trawl outperformed the commercial trawls. The histogram of the relative efficiency shows that the average relative efficiency is slightly lower than if the survey and commercial trawls were equally efficient. A map of the matched tows shows that there are three distinct areas where the matched tows occurred: southern New England, Georges Bank, and the Gulf of Maine. The majority of matched tows are in the Gulf of Maine and the eastern edge of Georges Bank.


T-Test

#looking at the boxplot and histograms of the catch per unit area 
box.plot(yellowtailfl_paired_tows_5mi_72hr_re)

hist_kh(yellowtailfl_paired_tows_5mi_72hr_re)

#log-transforming the data
yellowtailfl_paired_tows_5mi_72hr_re <- yellowtailfl_paired_tows_5mi_72hr_re %>%
  mutate(diff = sf_species_cpua - nefsc_species_cpua,
         log_sf_species_cpua = log(sf_species_cpua), 
         log_nefsc_species_cpua = log(nefsc_species_cpua), 
         diff_of_logs = log_sf_species_cpua - log_nefsc_species_cpua)

#looking at the boxplot and histograms of the difference between CPUA
box.plot_diff(yellowtailfl_paired_tows_5mi_72hr_re)

hist_kh_diff(yellowtailfl_paired_tows_5mi_72hr_re)

#looking at the boxplot and histograms of the diff_og_logs
box.plot_diff_log(yellowtailfl_paired_tows_5mi_72hr_re)

hist_kh_diff_log(yellowtailfl_paired_tows_5mi_72hr_re)

#assumptions of normality and equal variance are met with the log transformed data, 
#so the t-test was conducted with the log-transformed data

#assumptions of normality and equal variance are not met with the log transformed data, 
#so the t-test was conducted with the log-transformed data


yellowtail_ttest <- t.test(yellowtailfl_paired_tows_5mi_72hr_re$log_sf_species_cpua, yellowtailfl_paired_tows_5mi_72hr_re$log_nefsc_species_cpua, 
                           paired = TRUE)
yellowtail_ttest
## 
##  Paired t-test
## 
## data:  yellowtailfl_paired_tows_5mi_72hr_re$log_sf_species_cpua and yellowtailfl_paired_tows_5mi_72hr_re$log_nefsc_species_cpua
## t = 2.3235, df = 119, p-value = 0.02185
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06482849 0.81246080
## sample estimates:
## mean of the differences 
##               0.4386446
#back transform the estimate
exp(yellowtail_ttest$estimate)
## mean of the differences 
##                1.550604
exp(yellowtail_ttest$conf.int)
## [1] 1.066976 2.253446
## attr(,"conf.level")
## [1] 0.95


The boxplot for the data shows that the variance is relatively equal but the histogram shows that the catch rates for the commercial and survey nets are not normally distributed. However, because the data are paired, two observations (commercial/survey CPUA) from one study unit (‘paired tow’) the difference in catch-per-unit area needs to be investigated. The difference in catch-per-unit area is not normally distributed and is partially symmetric, so a log-transformation was performed. The histograms of the log-transformed data show that they are relatively normally distributed and the boxplot shows relative symmetry, so the analysis was conducted with the log-transformed data.

These data provide moderate evidence that the mean difference between commercial and survey catch rates is non-zero (two-sided p-value = 0.022 from a paired t-test). The mean catch rate for Yellowtail Flounder for commercial trawls from the study fleet was estimated to be 1.55 times the mean catch rate for the NEFSC survey (95% confidence interval: 1.07 to 2.25 times).


Model Runs

#scatterplot matrix of all the variables of interest 
ggpairs(yellowtailfl_paired_tows_5mi_72hr_re, columns = c("VESSEL_NAME", "season", "mgmt_area", "AVGDEPTH", "BOTTEMP"), 
        cardinality_threshold = 17)

#running all the models w every combination of variables and comparing them based on AIC and deviance explained 
yellowtailm <- fit.models(yellowtailfl_paired_tows_5mi_72hr_re)
compare.models(yellowtailm)
Model AIC deltaAIC Deviance Explained
relative_efficiency ~ VESSEL_NAME +mgmt_area 175 34 0.25
relative_efficiency ~ VESSEL_NAME +season 171 30 0.20
relative_efficiency ~ VESSEL_NAME +season + mgmt_area 171 30 0.41
relative_efficiency ~ VESSEL_NAME +AVGDEPTH 160 19 0.25
relative_efficiency ~ VESSEL_NAME +mgmt_area + AVGDEPTH 160 19 0.25
relative_efficiency ~ VESSEL_NAME +season + BOTTEMP 159 18 0.50
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + BOTTEMP 159 18 0.25
relative_efficiency ~ VESSEL_NAME +BOTTEMP 157 16 0.41
relative_efficiency ~ VESSEL_NAME +mgmt_area + BOTTEMP 157 16 0.25
relative_efficiency ~ VESSEL_NAME +season + AVGDEPTH 152 11 0.49
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + AVGDEPTH 152 11 0.50
relative_efficiency ~ VESSEL_NAME +season + AVGDEPTH + BOTTEMP 142 1 0.25
relative_efficiency ~ VESSEL_NAME +season + mgmt_area + AVGDEPTH + BOTTEMP 142 1 0.51
relative_efficiency ~ VESSEL_NAME +AVGDEPTH + BOTTEMP 141 0 0.49
relative_efficiency ~ VESSEL_NAME +mgmt_area + AVGDEPTH + BOTTEMP 141 0 0.51
#dropping the NAs in the bottom-temp column 
yellowtailfl_paired_tows_5mi_72hr_re1 <- yellowtailfl_paired_tows_5mi_72hr_re %>%
  drop_na(BOTTEMP, mgmt_area)



#storing the optimal model and plotting the diagnostics
yellowtailm_optimal <- glm(relative_efficiency ~ VESSEL_NAME + AVGDEPTH + BOTTEMP, family = binomial, data = yellowtailfl_paired_tows_5mi_72hr_re1)
summary(yellowtailm_optimal)
## 
## Call:
## glm(formula = relative_efficiency ~ VESSEL_NAME + AVGDEPTH + 
##     BOTTEMP, family = binomial, data = yellowtailfl_paired_tows_5mi_72hr_re1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.93173  -0.36474  -0.02228   0.28077   1.23972  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                       1.93508    1.76256   1.098   0.2723   
## VESSEL_NAMEANGELA AND ROSE        1.37423    2.14399   0.641   0.5215   
## VESSEL_NAMEBANTRY BAY            -0.25105    1.61989  -0.155   0.8768   
## VESSEL_NAMEDEBBIE SUE             0.40544    2.55402   0.159   0.8739   
## VESSEL_NAMEELIZABETH & KATHERINE -2.04237    1.84042  -1.110   0.2671   
## VESSEL_NAMEELLEN DIANE            0.83354    2.19646   0.379   0.7043   
## VESSEL_NAMEILLUSION              -0.06687    1.57722  -0.042   0.9662   
## VESSEL_NAMEJULIE ANN II           0.11646    2.80843   0.041   0.9669   
## VESSEL_NAMELADY JANE              0.11676    1.91581   0.061   0.9514   
## VESSEL_NAMELISA ANN II            0.56033    1.80611   0.310   0.7564   
## VESSEL_NAMELISA ANN III           0.49732    1.58225   0.314   0.7533   
## VESSEL_NAMEMARY K                -2.89538    3.53056  -0.820   0.4122   
## VESSEL_NAMEMYSTIC                -0.52740    1.64472  -0.321   0.7485   
## VESSEL_NAMEPROUD MARY            -0.26941    1.62548  -0.166   0.8684   
## VESSEL_NAMESAO PAULO             -0.29239    1.58220  -0.185   0.8534   
## VESSEL_NAMEVIRGINIA MARISE       -0.34158    2.03045  -0.168   0.8664   
## AVGDEPTH                         -0.05070    0.01587  -3.196   0.0014 **
## BOTTEMP                           0.16498    0.11091   1.488   0.1369   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.496  on 109  degrees of freedom
## Residual deviance: 24.572  on  92  degrees of freedom
## AIC: 140.99
## 
## Number of Fisher Scoring iterations: 6
par(mfrow=c(2,2))
plot(yellowtailm_optimal)

#back transform the estimates and CIs 
exp(cbind(OR = coef(yellowtailm_optimal), confint(yellowtailm_optimal)))
##                                          OR       2.5 %        97.5 %
## (Intercept)                      6.92461496 0.150825275   270.0393085
## VESSEL_NAMEANGELA AND ROSE       3.95201484 0.053810013   440.0986796
## VESSEL_NAMEBANTRY BAY            0.77798670 0.025398014    29.4390012
## VESSEL_NAMEDEBBIE SUE            1.49996738 0.008510348  2195.2586273
## VESSEL_NAMEELIZABETH & KATHERINE 0.12972094 0.002567317     6.2649896
## VESSEL_NAMEELLEN DIANE           2.30145109 0.029961721   296.6017823
## VESSEL_NAMEILLUSION              0.93532026 0.032928541    33.6029977
## VESSEL_NAMEJULIE ANN II          1.12351817 0.005844444 11940.2174807
## VESSEL_NAMELADY JANE             1.12385454 0.023912281    71.3152628
## VESSEL_NAMELISA ANN II           1.75125236 0.044083064    93.9339220
## VESSEL_NAMELISA ANN III          1.64431084 0.057691507    59.7603846
## VESSEL_NAMEMARY K                0.05527796          NA    11.7432698
## VESSEL_NAMEMYSTIC                0.59013757 0.018606496    23.1199089
## VESSEL_NAMEPROUD MARY            0.76383367 0.024491835    29.0103932
## VESSEL_NAMESAO PAULO             0.74647908 0.025931105    26.8710727
## VESSEL_NAMEVIRGINIA MARISE       0.71064626 0.009365556    52.2138927
## AVGDEPTH                         0.95056375 0.919355115     0.9790631
## BOTTEMP                          1.17937415 0.955484248     1.4881235
#look at the component residual plots 
crPlots(yellowtailm_optimal)

# Extract the fitted and predicted values 
yellowtailfl_paired_tows_5mi_72hr_re1$predict <- predict(yellowtailm_optimal)
yellowtailfl_paired_tows_5mi_72hr_re1$fitted <- fitted(yellowtailm_optimal)

#predicted vs observed values 
yellowtailfl_paired_tows_5mi_72hr_re1 %>%
  ggplot(aes(x = fitted, 
             y = relative_efficiency)) + 
  geom_point() + 
  geom_abline(intercept=0, slope=1) +
  xlim(0,1) + 
  ylim(0,1) + 
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')

#predicted vs trawl type w point shape as explanatory variables - not important for this project but saving for my thesis work 
# #mgmt_area
# yellowtailfl_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME,
#              y = fitted, color = mgmt_area)) +
#   geom_jitter() +
#   coord_flip()
# 
# #season
# yellowtailfl_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME, 
#              y = fitted, color = season)) + 
#   geom_jitter() +  
#   coord_flip()
# 
# #depth 
# yellowtailfl_paired_tows_5mi_72hr_re1 %>%
#   ggplot(aes(x = VESSEL_NAME, 
#              y = fitted, color = AVGDEPTH)) + 
#   geom_jitter() + 
#   coord_flip()




#getting final RE estimate 
yellowtailfl_paired_tows_5mi_72hr_re1 %>%
  group_by(VESSEL_NAME) %>%
  summarise(mean(fitted), n = n())
## # A tibble: 16 × 3
##    VESSEL_NAME           `mean(fitted)`     n
##    <chr>                          <dbl> <int>
##  1 ANGELA & ROSE                 0.456      2
##  2 ANGELA AND ROSE               0.607      2
##  3 BANTRY BAY                    0.402     11
##  4 DEBBIE SUE                    0.664      1
##  5 ELIZABETH & KATHERINE         0.244      7
##  6 ELLEN DIANE                   0.624      2
##  7 ILLUSION                      0.426     15
##  8 JULIE ANN II                  0.754      1
##  9 LADY JANE                     0.768      6
## 10 LISA ANN II                   0.674      4
## 11 LISA ANN III                  0.569      9
## 12 MARY K                        0.0533     2
## 13 MYSTIC                        0.376     24
## 14 PROUD MARY                    0.372      7
## 15 SAO PAULO                     0.357     15
## 16 VIRGINIA MARISE               0.446      2
yellowtailfl_paired_tows_5mi_72hr_re1 %>%
  summarise(mean(fitted), n = n())
##   mean(fitted)   n
## 1    0.4336806 110
yellowtailfl_paired_tows_5mi_72hr_re1 %>%
  ggplot(aes(x = VESSEL_NAME, 
             y = fitted)) + 
  geom_jitter() + 
  stat_summary(
    geom = "point",
    fun = "mean",
    col = "black",
    size = 3,
    shape = 24,
    fill = "red"
  ) + 
  labs(x = "Vessel Name", y = "Fitted Values of Relative Efficiency") + 
  geom_hline(yintercept = 0.4336806, linetype = "dashed", color = "red") + 
  coord_flip()


A scatterplot matrix of all the variables of interest showed no correlation among explanatory variables, but this is difficult to assess given that most of the explanatory variables are categorical. A backward regression analysis was conducted, however no variables were found to be significant so model selection was based on AIC score and the deviance explained. The optimal model for relative efficiency for yellowtail flounder from the Study Fleet dataset included vessel, management area, depth, and bottom temperature as explanatory variables. However, that model did not produce estimates of the management area effect, so the model with vessel, depth, and bottom temperature was selected. The diagnostic plots show no concerning trends, so the interpretation of results proceeded with this optimal model. There was one potential outlier in the diagnostic plots, but further investigation provided no reasoning to exclude the data point so the analysis proceeded with that model. The component residual show a strong negative relationship with depth and a slightly positive relationship with bottom temperature. The most efficient vessels as compared to the reference vessel, the Angela & Rose, after accounting for depth and bottom temperature were the Ellen Diane which had a relative efficiency for yellowtail 2.30 times higher (95% CI: 0.029 to 296 times), the Angela and Rose which had a relative efficiency 3.95 times higher (95% CI: 0.054 to 440 times), and the Lisa Ann II which had a relative efficiency 1.75 times higher (95% CI: 0.044 to 93.93 times). The modeled values of relative efficiency show that the most efficient vessels compared to the survey trawl were the Mary K, the Elizabeth & Katherine, and the Sao Paulo. The mean of the relative efficiency of the modeled values was 0.43, which is less than if the commercial and survey trawls had equal efficiency




Observer and NEFSC Pairs

5m and 72hr Paired Tows

Cod

Exploratory Visualization

paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 73 & NESPP4 == 818) %>% 
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_bar(aes(x = trip_haul_ID, y = ob_species_cpua, fill = "Observer"), stat = "identity") + 
  geom_bar(aes(x = trip_haul_ID, y = -nefsc_species_cpua, fill = "NEFSC"), stat = "identity") + 
  labs(x = "Tow Number", y = CPUA_label, 
       title = "Comparison of CPUA for Atlantic Cod", 
       subtitle = "Pairs within 5 mi and 72 hrs") + 
  scale_fill_manual(name = "Dataset", 
                    values = c("Observer" = "red", "NEFSC" = "blue"), 
                    labels = c("Observer", "NEFSC")) + 
  theme_classic() + 
  theme(axis.text.x=element_blank())

#Looking at the relative efficiency of the study fleet tows compared to the nefsc survey 
paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 73 & NESPP4 == 818) %>%
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() +
  geom_histogram(aes(x = relative_efficiency), bins = 100, color = "black", fill = "white") + 
  geom_vline(aes(xintercept = 0.5),
            color = "black", linetype = "dashed", size = 1) + 
  geom_vline(aes(xintercept = mean(relative_efficiency)),
            color = "red", linetype = "dashed", size = 1) + 
  labs(x = "Relative Efficiency", y = "Count", 
       title = "Histogram of Relative Efficiency for Atlantic Cod", 
       subtitle = "Pairs within 5 mi and 72 hrs")

#Map of the matched tows 
paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 73 & NESPP4 == 818) %>%
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_sf(data = nefsc_strata, fill = "grey") +
  geom_polygon(data = NEUS, aes(x=long, y = lat, group = group), fill = "darkgreen",  
               color="black",inherit.aes = FALSE)  +
  geom_point(data = cod_paired_tows_5mi_72hr_ob_re, aes(x = END_TOW_LON.x,y = END_TOW_LAT.x, color = "NEFSC")) +
  geom_point(data = cod_paired_tows_5mi_72hr_ob_re, aes(x = END_TOW_LON.y,y = END_TOW_LAT.y, color = "Observer")) +
  
  # geom_segment(data = paired_tows_1mi_24hr, aes(x = START_TOW_LON.x, xend = END_TOW_LON.x,
  #                                               y = START_TOW_LAT.x, yend = END_TOW_LAT.x)) +
  coord_sf(ylim = c(39, 44),  xlim = c(-74, -66)) + 
  scale_colour_manual("", breaks = c("NEFSC", "Observer"), values = c("blue", "red")) +
  labs(title = "Map of Matched Observer and NEFSC Tows for Cod", subtitle = "within 5 mi and 72 hrs")


A comparison of the catch-per-unit area shows that survey trawl is catching more Cod than the commercial trawls from the Observer dataset in most instances. The histogram of the relative efficiency shows that the average relative efficiency is slightly higher than if the survey and commercial trawls were equally efficient. A map of the matched tows shows that the majority of matched tows are in the Gulf of Maine and the northern edge of Georges Bank.


T-Test

#looking at the boxplot and histograms of the catch per unit area 
box.plot_ob(cod_paired_tows_5mi_72hr_ob_re)

hist_kh_ob(cod_paired_tows_5mi_72hr_ob_re)

#log-transforming the data
cod_paired_tows_5mi_72hr_ob_re <- cod_paired_tows_5mi_72hr_ob_re %>%
  mutate(diff = ob_species_cpua - nefsc_species_cpua,
         log_ob_species_cpua = log(ob_species_cpua), 
         log_nefsc_species_cpua = log(nefsc_species_cpua), 
         diff_of_logs = log_ob_species_cpua - log_nefsc_species_cpua)

#looking at the boxplot and histograms of the difference between CPUA
box.plot_diff(cod_paired_tows_5mi_72hr_ob_re)

hist_kh_diff(cod_paired_tows_5mi_72hr_ob_re)

# #looking at the boxplot and histograms of the diff_og_logs
# box.plot_diff_log(cod_paired_tows_5mi_72hr_ob_re)
# hist_kh_diff_log(cod_paired_tows_5mi_72hr_ob_re)

#assumptions of normality and equal variance are met with the data, 
#so the t-test was conducted with the raw data

cod_ttest_ob <- t.test(cod_paired_tows_5mi_72hr_ob_re$ob_species_cpua, cod_paired_tows_5mi_72hr_ob_re$nefsc_species_cpua,
                              paired = TRUE)
cod_ttest_ob
## 
##  Paired t-test
## 
## data:  cod_paired_tows_5mi_72hr_ob_re$ob_species_cpua and cod_paired_tows_5mi_72hr_ob_re$nefsc_species_cpua
## t = -2.6322, df = 53, p-value = 0.01109
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -296.55729  -40.05751
## sample estimates:
## mean of the differences 
##               -168.3074
# #back transform the estimate
# exp(cod_ttest_ob$estimate)
# exp(cod_ttest_ob$conf.int)


The boxplot for the data shows that the variance is not equal and the histogram shows that the catch rates for the commercial and survey nets are not normally distributed. However, because the data are paired, two observations (commercial/survey CPUA) from one study unit (‘paired tow’) the difference in catch-per-unit area needs to be investigated. The difference in catch-per-unit area is normally distributed and is relatively symmetric, so the analysis was conducted on the raw data.

These data provide moderately strong evidence that the mean difference between commercial and survey catch rates is non-zero (two-sided p-value = 0.01 from a paired t-test). The mean catch rate for Cod for commercial fishermen from the observer dataset was estimated to be 168.31 lb/km^2 less than the mean catch rate for the NEFSC survey trawl (95% confidence interval: 40.06 to 296.56 lb/km^2).


Model Runs

#scatterplot matrix of all the variables of interest 
ggpairs(cod_paired_tows_5mi_72hr_ob_re, columns = c("trawl_type", "season", "mgmt_area", "AVGDEPTH", "BOTTEMP"))

#running all the models w every combination of variables and comparing them based on AIC and deviance explained 
codm <- fit.models_ob(cod_paired_tows_5mi_72hr_ob_re)
compare.models_ob(codm)
Model AIC deltaAIC Deviance Explained
relative_efficiency ~ trawl_type +season + mgmt_area + BOTTEMP 82 9 0.06
relative_efficiency ~ trawl_type +season + BOTTEMP 81 7 0.05
relative_efficiency ~ trawl_type +season + mgmt_area 80 7 0.16
relative_efficiency ~ trawl_type +mgmt_area + BOTTEMP 80 7 0.05
relative_efficiency ~ trawl_type +BOTTEMP 79 5 0.06
relative_efficiency ~ trawl_type +season 78 5 0.16
relative_efficiency ~ trawl_type +season + mgmt_area + AVGDEPTH + BOTTEMP 78 5 0.06
relative_efficiency ~ trawl_type +mgmt_area 78 5 0.16
relative_efficiency ~ trawl_type +season + mgmt_area + AVGDEPTH 77 4 0.05
relative_efficiency ~ trawl_type +season + AVGDEPTH + BOTTEMP 77 3 0.18
relative_efficiency ~ trawl_type +mgmt_area + AVGDEPTH + BOTTEMP 76 3 0.16
relative_efficiency ~ trawl_type +mgmt_area + AVGDEPTH 75 2 0.07
relative_efficiency ~ trawl_type +season + AVGDEPTH 75 2 0.19
relative_efficiency ~ trawl_type +AVGDEPTH + BOTTEMP 74 1 0.18
relative_efficiency ~ trawl_type +AVGDEPTH 73 0 0.19
#storing the optimal model and plotting the diagnostics
codm_ob_optimal <- glm(relative_efficiency ~ trawl_type + AVGDEPTH, family = binomial, data = cod_paired_tows_5mi_72hr_ob_re)
summary(codm_ob_optimal)
## 
## Call:
## glm(formula = relative_efficiency ~ trawl_type + AVGDEPTH, family = binomial, 
##     data = cod_paired_tows_5mi_72hr_ob_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4348  -0.3611   0.1784   0.4496   1.0136  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.681908   0.791913  -0.861    0.389
## trawl_type2  0.353189   0.681992   0.518    0.605
## trawl_type3 -0.146080   1.482277  -0.099    0.921
## trawl_type4 -2.013589   2.785924  -0.723    0.470
## AVGDEPTH     0.009004   0.005776   1.559    0.119
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22.828  on 53  degrees of freedom
## Residual deviance: 19.253  on 49  degrees of freedom
## AIC: 73.178
## 
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(codm_ob_optimal)

#back transform the estimates and CIs 
exp(cbind(OR = coef(codm_ob_optimal), confint(codm_ob_optimal)))
##                    OR      2.5 %    97.5 %
## (Intercept) 0.5056511 0.10065700  2.375254
## trawl_type2 1.4236000 0.38287663  5.751907
## trawl_type3 0.8640889 0.02838505 21.985127
## trawl_type4 0.1335087         NA 11.486530
## AVGDEPTH    1.0090449 0.99795541  1.021112
#look at the component residual plots 
crPlots(codm_ob_optimal)

# Extract the fitted and predicted values 
cod_paired_tows_5mi_72hr_ob_re$predict <- predict(codm_ob_optimal)
cod_paired_tows_5mi_72hr_ob_re$fitted <- fitted(codm_ob_optimal)

#predicted vs observed values 
cod_paired_tows_5mi_72hr_ob_re %>%
  ggplot(aes(x = fitted, 
             y = relative_efficiency)) + 
  geom_point() + 
  geom_abline(intercept=0, slope=1) +
  xlim(0,1) + 
  ylim(0,1) + 
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')

#predicted vs trawl type w point shape as explanatory variables - not relevant for this project but saving for my thesis work 
# #mgmt_area
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = mgmt_area)) + 
#   geom_jitter()
# 
# #season
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = season)) + 
#   geom_jitter()
# 
# #depth 
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = AVGDEPTH)) + 
#   geom_jitter()
# 
# #bottom temp
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = BOTTEMP)) + 
#   geom_jitter() + 
#   coord_flip()
# 
# #ground cable length  
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = GRCABLEN)) + 
#   geom_jitter() + 
#   coord_flip()
# 
# #sweep diameter
# cod_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = SWEEPDIA)) + 
#   geom_jitter() + 
#   coord_flip()





#getting final RE estimate 
cod_paired_tows_5mi_72hr_ob_re %>%
  group_by(trawl_type) %>%
  summarise(mean(fitted), n = n())
## # A tibble: 4 × 3
##   trawl_type `mean(fitted)`     n
##   <fct>               <dbl> <int>
## 1 1                   0.611    37
## 2 2                   0.625    14
## 3 3                   0.469     2
## 4 4                   0.155     1
cod_paired_tows_5mi_72hr_ob_re %>%
  summarise(mean(fitted), n = n())
##   mean(fitted)  n
## 1    0.6005787 54
#plot of trawl type and relative efficiency estimates from the model and the mean for each trawl type 
cod_paired_tows_5mi_72hr_ob_re %>%
  ggplot(aes(x = trawl_type, 
             y = fitted)) + 
  geom_jitter() + 
  stat_summary(
    geom = "point",
    fun = "mean",
    col = "black",
    size = 3,
    shape = 24,
    fill = "red"
  ) + 
  labs(x = "Trawl Type", y = "Fitted Values of Relative Efficiency") + 
  geom_hline(yintercept = 0.4336806, linetype = "dashed", color = "red") + 
  scale_x_discrete(labels=c("1" = "Groundfish Trawls", "2" = "Flatfish Trawls",
                              "3" = "Separator Trawls", "4" = "Eliminator Trawls")) + 
  coord_flip()

cod_paired_tows_5mi_72hr_ob_re %>%
  ggplot(aes(x = AVGDEPTH, y = relative_efficiency, color = trawl_type, pch = trawl_type)) + 
  geom_point() + 
  geom_smooth(method = 'glm', method.args = list(family = 'binomial')) + 
  labs(x = "Average Depth", y = "Relative Efficiency") + 
  theme_classic()


A scatterplot matrix of all the variables of interest showed no correlation among explanatory variables, but this is difficult to assess given that most of the explanatory variables are categorical. A backward regression analysis was conducted, however no variables were found to be significant so model selection was based on AIC score and the deviance explained. The optimal model for relative efficiency for Cod from the Observer dataset included trawl type and depth. The diagnostic plots show no concerning trends, so the interpretation of results proceeded with this optimal model. There was one potential outlier in the diagnostic plots, but further investigation provided no reasoning to exclude the data point so the analysis proceeded with that model. The component residual show a strong positive relationship with depth. The most efficient trawl type was the flatfish trawl (trawl_type 2) which had a relative efficency 1.42 times greater than the reference trawl type, the groundfish trawl after accounting for depth (95% CI: 0.38 to 5.75). The modeled values for relative efficiency show that the mean relative efficiency was greater than 0.5, illustrating that the survey trawl was more efficient that the commercial gears.


YellowTail Flounder

Exploratory Visualization

#Looking at the cpua for the paired tows for each species 
paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 105 & NESPP4 == 1230) %>% 
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_bar(aes(x = trip_haul_ID, y = ob_species_cpua, fill = "Observer"), stat = "identity") + 
  geom_bar(aes(x = trip_haul_ID, y = -nefsc_species_cpua, fill = "NEFSC"), stat = "identity") + 
  labs(x = "Tow Number", y = CPUA_label, 
       title = "Comparison of CPUA for Yellowtail Flounder", 
       subtitle = "Pairs within 5 mi and 72 hrs") + 
  scale_fill_manual(name = "Dataset", 
                    values = c("Observer" = "red", "NEFSC" = "blue"), 
                    labels = c("Observer", "NEFSC")) + 
  theme_classic() + 
  theme(axis.text.x=element_blank())

#Looking at the relative efficiency of the study fleet tows compared to the nefsc survey 
paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 105 & NESPP4 == 1230) %>%
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() +
  geom_histogram(aes(x = relative_efficiency), bins = 100, color = "black", fill = "white") + 
  geom_vline(aes(xintercept = 0.5),
            color = "black", linetype = "dashed", size = 1) + 
  geom_vline(aes(xintercept = mean(relative_efficiency)),
            color = "red", linetype = "dashed", size = 1) + 
  labs(x = "Relative Efficiency", y = "Count", 
       title = "Histogram of Relative Efficiency for Yellowtail Flounder", 
       subtitle = "Pairs within 5 mi and 72 hrs")

#Map of the matched tows 
paired_tows_5mi_72hr_ob_re %>%
  filter(SVSPP == 105 & NESPP4 == 1230) %>%
  distinct(trip_haul_ID, .keep_all = TRUE) %>%
  ggplot() + 
  geom_sf(data = nefsc_strata, fill = "grey") +
  geom_polygon(data = NEUS, aes(x=long, y = lat, group = group), fill = "darkgreen",  
               color="black",inherit.aes = FALSE)  +
  geom_point(data = yellowtailfl_paired_tows_5mi_72hr_ob_re, aes(x = END_TOW_LON.x,y = END_TOW_LAT.x, color = "NEFSC")) +
  geom_point(data = yellowtailfl_paired_tows_5mi_72hr_ob_re, aes(x = END_TOW_LON.y,y = END_TOW_LAT.y, color = "Observer")) +
  
  # geom_segment(data = paired_tows_1mi_24hr, aes(x = START_TOW_LON.x, xend = END_TOW_LON.x,
  #                                               y = START_TOW_LAT.x, yend = END_TOW_LAT.x)) +
  coord_sf(ylim = c(39, 44),  xlim = c(-74, -66)) + 
  scale_colour_manual("", breaks = c("NEFSC", "Observer"), values = c("blue", "red")) +
  labs(title = "Map of Matched Observer and NEFSC Tows for Yellowtail Flounder", subtitle = "within 5 mi and 72 hrs")


A comparison of the catch-per-unit area shows that survey trawl is catching more yellowtail flounder than the commercial trawls from the Observer dataset in most instances. The histogram of the relative efficiency shows that the average relative efficiency is higher than if the survey and commercial trawls were equally efficient. A map of the matched tows shows that the majority of matched tows are in the Gulf of Maine and the southeastern edge of Georges Bank.


T-Test

#looking at the boxplot and histograms of the catch per unit area 
box.plot_ob(yellowtailfl_paired_tows_5mi_72hr_ob_re)

hist_kh_ob(yellowtailfl_paired_tows_5mi_72hr_ob_re)

#log-transforming the data
yellowtailfl_paired_tows_5mi_72hr_ob_re <- yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
  mutate(diff = ob_species_cpua - nefsc_species_cpua,
         log_ob_species_cpua = log(ob_species_cpua), 
         log_nefsc_species_cpua = log(nefsc_species_cpua), 
         diff_of_logs = log_ob_species_cpua - log_nefsc_species_cpua)

#looking at the boxplot and histograms of the difference between CPUA
box.plot_diff(yellowtailfl_paired_tows_5mi_72hr_ob_re)

hist_kh_diff(yellowtailfl_paired_tows_5mi_72hr_ob_re)

#looking at the boxplot and histograms of the diff_og_logs
box.plot_diff_log(yellowtailfl_paired_tows_5mi_72hr_ob_re)

hist_kh_diff_log(yellowtailfl_paired_tows_5mi_72hr_ob_re)

#assumptions of normality and equal variance are met with the log transformed data, 
#so the t-test was conducted with the log-transformed data

yellowtail_ttest_ob <- t.test(yellowtailfl_paired_tows_5mi_72hr_ob_re$log_ob_species_cpua, yellowtailfl_paired_tows_5mi_72hr_ob_re$log_nefsc_species_cpua,
                              paired = TRUE)
yellowtail_ttest_ob
## 
##  Paired t-test
## 
## data:  yellowtailfl_paired_tows_5mi_72hr_ob_re$log_ob_species_cpua and yellowtailfl_paired_tows_5mi_72hr_ob_re$log_nefsc_species_cpua
## t = -3.1917, df = 50, p-value = 0.002445
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1536280 -0.2624755
## sample estimates:
## mean of the differences 
##              -0.7080517
#back transform the estimate
exp(yellowtail_ttest_ob$estimate)
## mean of the differences 
##                0.492603
exp(yellowtail_ttest_ob$conf.int)
## [1] 0.3154901 0.7691452
## attr(,"conf.level")
## [1] 0.95
#get the back transformed estimate into a percentage
1 - exp(yellowtail_ttest_ob$estimate)
## mean of the differences 
##                0.507397
1 - exp(yellowtail_ttest_ob$conf.int)
## [1] 0.6845099 0.2308548
## attr(,"conf.level")
## [1] 0.95


The boxplot for the data shows that the variance is relatively equal but the histogram shows that the catch rates for the commercial and survey nets are not normally distributed. However, because the data are paired, two observations (commercial/survey CPUA) from one study unit (‘paired tow’) the difference in catch-per-unit area needs to be investigated. The difference in catch-per-unit area is not normally distributed and is only somewhat symmetric, so a log-transformation was performed. The histograms of the log-transformed data show that they are relatively normally distributed and the boxplot shows relative symmetry, so the analysis was conducted with the log-transformed data.

These data provide strong evidence that the mean difference between commercial and survey catch rates is non-zero (two-sided p-value = 0.0024 from a paired t-test). The mean catch rate for Yellowtail Flounder for commercial trawls from the observer dataset was estimated to be 51% less than the mean catch rate for the NEFSC survey (95% confidence interval: 23% to 68%).


Model Runs

#scatterplot matrix of all the variables of interest 
ggpairs(yellowtailfl_paired_tows_5mi_72hr_ob_re, columns = c("trawl_type", "season", "mgmt_area", "AVGDEPTH", "BOTTEMP"))

#running all the models w every combination of variables and comparing them based on AIC and deviance explained 
yellowtailm_ob <- fit.models_ob(yellowtailfl_paired_tows_5mi_72hr_ob_re)
compare.models_ob(yellowtailm_ob)
Model AIC deltaAIC Deviance Explained
relative_efficiency ~ trawl_type +mgmt_area + AVGDEPTH 76 7 0.26
relative_efficiency ~ trawl_type +season + mgmt_area + AVGDEPTH + BOTTEMP 76 7 0.11
relative_efficiency ~ trawl_type +mgmt_area 74 5 0.10
relative_efficiency ~ trawl_type +mgmt_area + AVGDEPTH + BOTTEMP 74 5 0.12
relative_efficiency ~ trawl_type +season + mgmt_area + BOTTEMP 74 5 0.28
relative_efficiency ~ trawl_type +season + mgmt_area + AVGDEPTH 74 5 0.26
relative_efficiency ~ trawl_type +season + AVGDEPTH + BOTTEMP 73 4 0.26
relative_efficiency ~ trawl_type +AVGDEPTH 73 4 0.13
relative_efficiency ~ trawl_type +mgmt_area + BOTTEMP 72 3 0.19
relative_efficiency ~ trawl_type +season + mgmt_area 72 3 0.15
relative_efficiency ~ trawl_type +AVGDEPTH + BOTTEMP 72 3 0.29
relative_efficiency ~ trawl_type +season + BOTTEMP 71 2 0.28
relative_efficiency ~ trawl_type +season + AVGDEPTH 71 2 0.26
relative_efficiency ~ trawl_type +BOTTEMP 71 2 0.20
relative_efficiency ~ trawl_type +season 69 0 0.29
#storing the optimal model and plotting the diagnostics
yellowtailm_ob_optimal <- glm(relative_efficiency ~ trawl_type + season, family = binomial, data = yellowtailfl_paired_tows_5mi_72hr_ob_re)
summary(yellowtailm_ob_optimal)
## 
## Call:
## glm(formula = relative_efficiency ~ trawl_type + season, family = binomial, 
##     data = yellowtailfl_paired_tows_5mi_72hr_ob_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9129  -0.2755   0.1122   0.3819   0.8650  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    1.2829     0.5770   2.223   0.0262 *
## trawl_type2   -0.2699     0.6562  -0.411   0.6809  
## trawl_type3   -0.5475     1.5426  -0.355   0.7227  
## trawl_type4   -1.3349     1.0646  -1.254   0.2099  
## seasonSpring  -1.0825     0.6541  -1.655   0.0979 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14.805  on 50  degrees of freedom
## Residual deviance: 10.999  on 46  degrees of freedom
## AIC: 69.124
## 
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(yellowtailm_ob_optimal)

#back transform the estimates and CIs 
exp(cbind(OR = coef(yellowtailm_ob_optimal), confint(yellowtailm_ob_optimal)))
##                     OR      2.5 %    97.5 %
## (Intercept)  3.6069661 1.25307006 12.557712
## trawl_type2  0.7634736 0.20784580  2.803400
## trawl_type3  0.5784082 0.02141001 19.889140
## trawl_type4  0.2631811 0.02928074  2.210882
## seasonSpring 0.3387476 0.08807172  1.182124
#look at the component residual plots 
crPlots(yellowtailm_ob_optimal)

# Extract the fitted and predicted values 
yellowtailfl_paired_tows_5mi_72hr_ob_re$predict <- predict(yellowtailm_ob_optimal)
yellowtailfl_paired_tows_5mi_72hr_ob_re$fitted <- fitted(yellowtailm_ob_optimal)


#predicted vs observed values 
yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
  ggplot(aes(x = fitted, 
             y = relative_efficiency)) + 
  geom_point() + 
  geom_abline(intercept=0, slope=1) +
  xlim(0,1) + 
  labs(x='Predicted Values', y='Actual Values', title='Predicted vs. Actual Values')

#predicted vs trawl type w point shape as explanatory variables - not important for this project but saving for my thesis work 
# #mgmt_area
# yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = mgmt_area)) + 
#   geom_jitter()
# 
# #season
# yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = season)) + 
#   geom_jitter()
# 
# #depth 
# yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
#   ggplot(aes(x = trawl_type, 
#              y = fitted, color = AVGDEPTH)) + 
#   geom_jitter()



#getting final RE estimate 
yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
  group_by(trawl_type) %>%
  summarise(mean(fitted), n = n())
## # A tibble: 4 × 3
##   trawl_type `mean(fitted)`     n
##   <fct>               <dbl> <int>
## 1 1                   0.682    23
## 2 2                   0.578    21
## 3 3                   0.545     2
## 4 4                   0.487     5
yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
  summarise(mean(fitted), n = n())
##   mean(fitted)  n
## 1    0.6146174 51
yellowtailfl_paired_tows_5mi_72hr_ob_re %>%
  ggplot(aes(x = trawl_type, 
             y = fitted)) + 
  geom_jitter() + 
  stat_summary(
    geom = "point",
    fun = "mean",
    col = "black",
    size = 3,
    shape = 24,
    fill = "red"
  ) + 
  labs(x = "Trawl Type", y = "Fitted Values of Relative Efficiency") + 
  geom_hline(yintercept = 0.6146174, linetype = "dashed", color = "red") + 
  scale_x_discrete(labels=c("1" = "Groundfish Trawls", "2" = "Flatfish Trawls",
                              "3" = "Separator Trawls", "4" = "Eliminator Trawls")) + 
  coord_flip()


A scatterplot matrix of all the variables of interest showed no correlation among explanatory variables, but this is difficult to assess given that most of the explanatory variables are categorical. A backward regression analysis was conducted, however no variables were found to be significant so model selection was based on AIC score and the deviance explained. The optimal model for relative efficiency for yellowtail flounder from the Observer dataset included trawl type and season. The diagnostic plots show some patterns of grouping, but that is believed to be due to the model only containing categorical variables, so the interpretation of results proceeded with this optimal model. The component residual show greater relative efficiency in the fall. The most efficient trawl type was the reference trawl (trawl_type 1), the groundfish trawl. The modeled values for relative efficiency show that the mean relative efficiency was greater than 0.5, illustrating that the survey trawl was more efficient that the commercial gears.


Conclusion